Benchmark multi-column GROUP BY performance by nathanb9 · Pull Request #22322 · apache/datafusion

nathanb9 · 2026-05-18T07:15:00Z

Which issue does this PR close?

Measuring: In general the multi group by option seems to make certain scenarios worse. #17850

Rationale for this change

Measure multi-column GROUP BY performance with a fair apples-to-apples comparison
The default vectorized per-column approach (GroupValuesColumn) underperforms the row-based approach (GroupValuesRows) when distinct group count is small relative
to input rows. The benchmark confirms this: row-based is 16-19% faster below ~200K groups, while vectorized wins by 15-33% above ~500K groups.

What changes are included in this PR?

Adds a benchmark in datafusion/physical-plan/benches/multi_group_by.rs that directly calls GroupValues::intern() with identical Int32 data for both implementations —
no SQL/planning/IO overhead, same schema, same hashing.

Makes mod row public so the benchmark can instantiate GroupValuesRows directly.

Test cases:

Issue In general the multi group by option seems to make certain scenarios worse. #17850 reproduction (3 cols, 64 groups, 1M-50M rows) — confirms row-based wins ~16-19%
Low cardinality sweep (8-4096 groups, 3-4 cols) — row-based wins 15-38%
Batch size sensitivity (1K-32K) — minimal effect on ratio
Column scaling with low groups (2-10 cols) — row-based advantage grows with columns
High cardinality scaling (1M groups, 2-10 cols) — vectorized wins 22-43%
Group count sweep (16 to 1M groups, 4 cols) — crossover at ~200K-500K groups

Are these changes tested?

cargo fmt --all
cargo clippy -p datafusion-physical-plan --bench multi_group_by -- -D warnings
cargo bench -p datafusion-physical-plan --bench multi_group_by

Are there any user-facing changes?

No. This adds a benchmark only.

kosiew

@nathanb9
Thanks for putting this benchmark together.
I think there are a few issues that make it hard to validate the conclusions the PR is trying to draw around GroupValuesColumn vs GroupValuesRows behavior.

kosiew · 2026-05-19T09:54:38Z

+
+fn generate_parquet_file(num_cols: usize, cardinality: usize) -> NamedTempFile {
+    let mut rng = StdRng::seed_from_u64(42);
+    let fields: Vec<Field> = (0..num_cols)


Right now the benchmark only generates Int32 grouping columns, so DataFusion will always pick the vectorized GroupValuesColumn path for these multi-column GROUP BYs.

The PR description talks about measuring when the default per-column implementation starts to lose against the row-based GroupValuesRows implementation, but this benchmark never exercises the row-based path, so it cannot really validate the claimed crossover point.

Could we add a second benchmark variant that explicitly uses GroupValuesRows, or benchmark the two GroupValues implementations directly? That would make it possible to compare both implementations across the same cardinality and column-count combinations.

but this benchmark never exercises the row-based path, so it cannot really validate the claimed crossover point.

Yes, the way I benchmarked is by making code change to enforce one or the other and run them separately.

Addressed. I added a new benchmar that directly instantiates both GroupValuesColumn (vectorized) and GroupValuesRows (row-based) and calls their intern() method with identical Int32 data

kosiew · 2026-05-19T09:54:38Z

+}
+
+#[expect(clippy::needless_pass_by_value)]
+fn query(ctx: Arc<Mutex<SessionContext>>, rt: &Runtime, sql: &str) {


It looks like each Criterion iteration calls ctx.sql(sql) before df.collect(), which means planning and logical/physical optimization are currently included in the measured path.

The module docs say the query is pre-planned and that the benchmark is only measuring execution, so this seems a bit inconsistent. It also makes the column-count experiment harder to interpret because planning cost grows with the generated projection and grouping expressions.

Could we construct the plan once during setup and only time repeated execution of that plan? Otherwise it would help to update the benchmark naming/docs to make it clear that planning is intentionally included.

Instead of pre-planning a physical plan (which can't be re-executed due to internal state), the new benchmark bypasses SQL/planning entirely by calling GroupValues::intern() directly on pre-generated in-memory batches.

kosiew · 2026-05-19T09:54:38Z

+        b.iter(|| query(b_ctx.ctx.clone(), &rt, &build_group_by_sql(4)))
+    });
+
+    let b_ctx = prepare_context(&rt, 4, 30); // 30^4 = 810K groups


Some of the benchmark names/comments currently report the theoretical key-space size rather than the number of distinct groups actually produced.

With around 1M input rows, the 30^4 = 810K case only produces roughly ~574K observed groups in expectation, and the 100^4 / 500^4 cases are effectively capped near the input row count instead of producing 100M or 62B distinct groups.

Since the PR is trying to identify a distinct-group-count threshold, these labels end up being a bit misleading. Could we either generate data with controlled exact NDV per case, or rename/report these as key-space cardinalities and include the measured distinct counts alongside them?

Just addresed by using sequential enumeration (global_row % num_distinct_groups decomposed per-column) which guarantees the exact number of distinct groups matches the label. No random sampling

@kosiew

…g overhead, exact NDV Addresses all three review comments from @kosiew: 1. **Implementation comparison**: Benchmarks both GroupValuesColumn (vectorized, via Int32 columns) and GroupValuesRows (row-based, via FixedSizeBinary(4) columns that trigger the fallback path) side-by-side. 2. **Execution-only timing**: Pre-optimizes the logical plan once via `df.into_parts()`. Each benchmark iteration only does physical planning + execution, excluding SQL parsing and logical optimization. 3. **Exact cardinality**: Replaces random sampling with sequential enumeration (`global_row % num_distinct_groups` decomposed per-column), guaranteeing precise distinct group counts with no birthday-paradox error. Additionally motivated by apache#17850, adds comprehensive experiments: - Issue apache#17850 regression reproduction (3 cols, 64 groups, 1M-50M rows) - Low cardinality sweep (8-4096 groups) - Batch size sensitivity (1K-32K) - Column count scaling (2-10 cols, low and high cardinality) - Group count sweep (16 to 1M groups) - Random vs sequential data patterns Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a fair apples-to-apples benchmark that directly calls GroupValues::intern() with identical Int32 data for both GroupValuesColumn (vectorized) and GroupValuesRows (row-based). This eliminates the previous confounding factors (different data types, SQL/planning overhead) and confirms the regression reported in apache#17850: row-based is 16-19% faster at low cardinality (64 groups), with a crossover at ~200K-500K groups where vectorized becomes faster. Experiments: - Issue apache#17850 reproduction (3 cols, 64 groups, 1M-50M rows) - Low cardinality sweep (8-4096 groups) - Batch size sensitivity (1K-32K) - Column count scaling (2-10 cols) - High cardinality scaling (1M groups) - Group count sweep (16 to 1M groups) - Random vs sequential data patterns

nathanb9 · 2026-05-20T18:11:09Z

 pub mod multi_group_by;

-mod row;
+pub mod row;


Required for the benchmark to directly instantiate GroupValuesRows with the same Int32 schema used by GroupValuesColumn. Without this, benchmarks can only trigger the row-based path via an unsupported type, which makes the comparison unfair.

nathanb9 · 2026-05-20T18:50:56Z

Thanks for the review @kosiew.

Changes from your comments:

Both GroupValuesColumn and GroupValuesRows are now benchmarked directly via the intern() API with identical Int32 data
Benchmark bypasses SQL/planning entirely
Exact cardinality via sequential enumeration. Labels now match actual distinct group counts.

One thing im not sure about: making mod row public. This is needed because new_group_values() with an Int32 schema will always route to GroupValuesColumn. There's no way to make it pick GroupValuesRows for Int32 without. I can make it so only thing pub is pub use row::GroupValuesRows;

kosiew

@nathanb9
Thanks for the updates here. The benchmark changes address the earlier concerns around comparing GroupValuesColumn vs GroupValuesRows directly and removing SQL planning overhead from the measured path.

I still found one issue with the 1M-group benchmark cases. Because generate_batches truncates to whole batches, the generated input for the advertised 1_000_000 row/group cases is actually smaller than the benchmark labels suggest.

kosiew · 2026-05-22T05:17:02Z

+        .powf(1.0 / num_cols as f64)
+        .ceil() as usize;
+
+    let num_batches = num_rows / batch_size;


generate_batches currently computes num_batches = num_rows / batch_size, which drops the remainder rows.

With the default 8192 batch size and num_rows = 1_000_000, the benchmark generates only 122 * 8192 = 999_424 rows. That means the grp_1M / cols_*_grp_1M cases can produce at most 999_424 observed groups, not 1,000,000.

So while the sequential enumeration change fixes most of the earlier cardinality-label mismatch, the exact 1M threshold cases are still slightly off.

Could we either generate the final partial batch, or adjust/report the benchmark labels based on the actual generated row/group counts?

fixed now I get the remainder as well

kosiew · 2026-05-22T05:17:02Z

 pub mod multi_group_by;

-mod row;
+pub mod row;


On the question about pub mod row: if the public access is only needed for the benchmark, I think it would be cleaner to keep mod row; private and expose only the narrow surface that is needed, for example pub use row::GroupValuesRows;.

That keeps the internal module layout private instead of publishing the full module path just for benchmark access.

Sounds good. pub use row::GroupValuesRows;

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kosiew

@nathanb9
Thanks for the iteration.

Looks 👍 to me

nathanb9 · 2026-05-23T02:21:34Z

@kosiew Thanks for your thorough review

Add multi-column group by benchmark

d25b053

github-actions Bot added the core Core DataFusion crate label May 18, 2026

nathanb9 marked this pull request as ready for review May 18, 2026 08:19

nathanb9 mentioned this pull request May 18, 2026

In general the multi group by option seems to make certain scenarios worse. #17850

Open

kosiew requested changes May 19, 2026

View reviewed changes

Nathan Bezualem and others added 2 commits May 20, 2026 10:58

github-actions Bot added the physical-plan Changes to the physical-plan crate label May 20, 2026

nathanb9 commented May 20, 2026

View reviewed changes

Remove end-to-end SQL benchmark in favor of direct intern() benchmark

c5582f3

github-actions Bot removed the core Core DataFusion crate label May 20, 2026

Remove random vs sequential experiment

bc4e236

nathanb9 requested a review from kosiew May 20, 2026 18:51

kosiew requested changes May 22, 2026

View reviewed changes

Fix partial batch truncation and narrow row module visibility

d8d6e98

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

nathanb9 requested a review from kosiew May 22, 2026 12:19

kosiew approved these changes May 23, 2026

View reviewed changes

Dandandan added this pull request to the merge queue May 23, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 23, 2026

Dandandan added this pull request to the merge queue May 23, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 23, 2026

Conversation

nathanb9 commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nathanb9 commented May 20, 2026

Uh oh!

kosiew left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kosiew left a comment

Choose a reason for hiding this comment

Uh oh!

nathanb9 commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nathanb9 commented May 18, 2026 •

edited

Loading

kosiew left a comment •

edited

Loading

kosiew left a comment •

edited

Loading

nathanb9 commented May 23, 2026 •

edited

Loading